feat(tools): truncate large tool results with workspace spill by TatsuKo-Tsukimi · Pull Request #489 · dataelement/Clawith

TatsuKo-Tsukimi · 2026-04-27T08:27:35Z

Summary

Tool calls returning ~50KB of search hits or PDF extracts used to enter the in-context history verbatim, burning tokens for content the model samples one slice of. New services/tool_result_truncation.py spills oversized payloads to agent_data/<agent_id>/_tool_results/<call_id>.txt and replaces the in-context body with a head excerpt + a marker pointing the model at read_file for retrieval.

Threshold: 4000 estimated tokens (~12KB at chars/3) — soft-start to give the model time to learn the read_file follow-up; can tighten to ~2000 once telemetry confirms reliable use of the marker.

Smart head:

JSON-shape responses with a known array key (results/items/data/entries/hits/documents) keep metadata + first 5 items + a synthetic _truncated_<key>_count field, so search/list responses stay useful
Otherwise plain head-cut (preferring a line boundary when one is near the cut point)
Vision-injected list payloads (image_data markers) pass through

Sandbox: spill files live under each agent's existing AGENT_DATA_DIR sandbox; cross-agent leak is prevented by the existing read_file boundary.

Changes

Applied at three sites to prevent drift:

llm/caller.py:_process_tool_call (call_llm path)
llm/caller.py:call_agent_llm_with_tools inner loop
services/heartbeat.py tool loop

agent_context.py rule 3 augmented with a paragraph teaching the model to recognize the marker and use read_file for full content rather than fabricating from prior knowledge.

16 new unit tests (pass-through, spill, utf-8 round-trip, threshold boundary, JSON-shape preservation, write-failure fallback, realistic 50-result jina_search end-to-end).

Stack

PR 3/4 of context-budget hygiene series.

fix(history): atomic pair-aware truncation for tool_call blocks #487 — pair-aware truncation (must merge first)
feat(agent): token-aware history truncation alongside message cap #488 — token-aware truncation (must merge before this)
#THIS_PR — tool result truncation (this PR)
follow-up — parallel tool exec + consolidate (depends on this)

Test plan

pytest backend/tests/test_tool_result_truncation.py — 16/16 pass
Reviewer: 4000-token threshold — too aggressive for jina_search returns? Should we per-tool-customize?
Reviewer: spill files accumulate forever — need a GC story or retention policy follow-up?
Reviewer: confirm the sandbox boundary (agent A can't read agent B's _tool_results/) holds under A2A relationship paths

Note: 3 tool loops still exist (caller.py:432, caller.py:763, heartbeat.py:335). Drift prevention here is partial — the next PR consolidates them into a single execute_tool_calls helper.

🤖 Generated with Claude Code

Replace naive `conversation[-ctx_size:]` slicing with a walker that treats assistant.tool_calls and its matching role="tool" messages as one atomic block. Naive slicing can leave an orphan role="tool" at the head when the cut lands mid-pair — OpenAI rejects this with "No tool call found for function call output" (issue dataelement#446). New helper `services/history_window.truncate_by_message_count` walks tail backward, identifies blocks, and includes blocks whole or not at all. Orphan tools are silently dropped regardless of budget. Replaces head-only pop guard in: - app/api/websocket.py (web chat) - app/api/feishu.py (feishu channel) Leaves app/services/llm/caller.py:626 untouched (call_agent_llm short-reply path's hardcoded [-10:] is intentional). Tests: 15 covering empty input, budget edges, parallel tool_calls, multi-pair, head/mid orphans, realistic 60-message invariant. Addresses dataelement#446 failure mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b880d1ff67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T08:34:13Z

+        return tool_content
+
+    # Spill full content to disk
+    full_path = agent_workspace / "_tool_results" / f"{call_id}.txt"


Use a unique spill filename per tool invocation

Writing spill files as _tool_results/{call_id}.txt can overwrite earlier truncated outputs because call_id is not globally unique across rounds for all providers (for example, the Gemini adapters synthesize IDs like call_1, call_2 each response in backend/app/services/llm/client.py). When a later tool call reuses the same ID, an older in-context marker now points to unrelated content, so a subsequent read_file retrieves the wrong payload and can corrupt the agent’s reasoning/output.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1cff15b156

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T08:53:19Z

+
+    try:
+        spill_root.mkdir(parents=True, exist_ok=True)
+        full_path.write_text(tool_content, encoding="utf-8")


Bound growth of spilled tool-result files

Each oversized tool call writes a full payload file under _tool_results/ and there is no retention or cleanup path anywhere in the new truncation flow, so long-running agents that frequently call search/PDF tools will accumulate files without bound until AGENT_DATA_DIR fills up. Once disk pressure hits, subsequent spills fall back to partial excerpts and the model can no longer retrieve full results, so this becomes a production reliability issue rather than just storage overhead.

Useful? React with 👍 / 👎.

Add `agents.context_window_tokens` field (default 50000) and a new `truncate_by_token_budget` helper that bounds in-context history by both estimated token cost (primary) and message count (safety cap), preserving assistant.tool_calls ↔ role=tool pairs intact via the same walker as truncate_by_message_count. Why: message-count alone is a wildly variable proxy for token cost — one 50KB tool result eats more context than 100 short user messages. Token budget gives predictable behavior across heterogeneous traffic; message cap remains as a safety net against pathological tiny-message floods. Changes: - models/agent.py: + context_window_tokens (Integer, default=50000) + DEFAULT_CONTEXT_WINDOW_TOKENS constant - schemas/schemas.py: AgentOut, AgentUpdate (1000 <= tokens <= 500000) - alembic: add_context_window_tokens.py (idempotent IF NOT EXISTS) - services/history_window.py: + truncate_by_token_budget, refactored common walker, JSON-serialized char->token estimate via existing estimate_tokens_from_chars (chars/3 — overestimates safely) - api/websocket.py: pass tok_budget to helper, raise DB load to max(ctx_size, 500) so helper has room to choose - api/feishu.py: same pattern at 2 sites (web chat + IM channel paths) - frontend: AgentDetail Settings slider + i18n + types 10 new tests covering token-budget mode (huge-message dropped, both-bounds interaction, atomic pair preservation, orphan defense). 25/25 pass. Other channels (dingtalk/discord/slack/teams/wecom/whatsapp) still use DB-level message-count limit only — they don't get token awareness in this PR but won't crash. Migrating them is follow-up scope. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Tool calls returning ~50 KB of search hits or PDF extracts used to enter the in-context history verbatim, burning tokens for content the model samples one slice of. New `services/tool_result_truncation.py` spills oversized payloads to `agent_data/<agent_id>/_tool_results/<call_id>.txt` and replaces the in-context body with a head excerpt + a marker pointing the model at `read_file` for retrieval. Threshold: 4000 estimated tokens (~12 KB at chars/3) — soft-start to give the model time to learn the read_file follow-up; can tighten to ~2000 once telemetry confirms reliable use of the marker. Smart head: - JSON-shape responses with a known array key (results/items/data/ entries/hits/documents) keep metadata + first 5 items + a synthetic _truncated_<key>_count field, so search/list responses stay useful. - Otherwise plain head-cut (preferring a line boundary when one is near the cut point). - Vision-injected list payloads (image_data markers) pass through. Sandbox: spill files live under each agent's existing AGENT_DATA_DIR sandbox; cross-agent leak is prevented by the existing read_file boundary. Applied at three sites to prevent drift: - llm/caller.py _process_tool_call (call_llm path) - llm/caller.py call_agent_llm_with_tools inner loop - services/heartbeat.py tool loop agent_context.py rule 3 augmented with a paragraph teaching the model to recognize the marker and use read_file for full content rather than fabricating from prior knowledge. 16 new unit tests covering pass-through, spill, utf-8 round-trip, threshold boundary, JSON-shape preservation, write-failure fallback, and a realistic 50-result jina_search end-to-end. 41/41 total pass. Note: 3 tool loops still exist (caller.py:432, caller.py:763, heartbeat.py:335). Drift prevention is partial here — PR 3 consolidates them into a single execute_tool_calls helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

TatsuKo-Tsukimi mentioned this pull request Apr 27, 2026

perf(tools): parallelize read-only tool calls and consolidate three loops #490

Open

5 tasks

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

TatsuKo-Tsukimi force-pushed the feat/tool-result-truncation branch from b880d1f to 1cff15b Compare April 27, 2026 08:45

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

TatsuKo-Tsukimi and others added 2 commits April 27, 2026 17:51

TatsuKo-Tsukimi force-pushed the feat/tool-result-truncation branch from 1cff15b to b01c0f1 Compare April 27, 2026 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tools): truncate large tool results with workspace spill#489

feat(tools): truncate large tool results with workspace spill#489
TatsuKo-Tsukimi wants to merge 3 commits intodataelement:mainfrom
TatsuKo-Tsukimi:feat/tool-result-truncation

TatsuKo-Tsukimi commented Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TatsuKo-Tsukimi commented Apr 27, 2026

Summary

Changes

Stack

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant